Smoothing

Rafael A. Irizarry

2008 Elections

polls_2008 %>% select(start_date, Obama, McCain)
Source: local data frame [543 x 3]

   start_date Obama McCain
       (time) (dbl)  (dbl)
1  2008-11-03  0.52   0.43
2  2008-11-02  0.49   0.44
3  2008-11-01  0.51   0.46
4  2008-11-01  0.52   0.44
5  2008-11-01  0.52   0.46
6  2008-11-01  0.53   0.45
7  2008-10-31  0.54   0.43
8  2008-10-30  0.52   0.44
9  2008-11-02  0.53   0.44
10 2008-11-01  0.51   0.43
..        ...   ...    ...

Difference versus day

plot of chunk unnamed-chunk-4

Not a good fit

plot of chunk unnamed-chunk-5

\( f(x) = \mbox{E}(Y \mid X=x) \) is not a line.

Bin Smoother

plot of chunk binsmoother

Bin Smoother

bin_smoother1

Bin Smoother

plot of chunk unnamed-chunk-6

Kernels

bin_smoother2

Kernels

plot of chunk unnamed-chunk-7

Loess

plot of chunk unnamed-chunk-8

Loess

plot of chunk unnamed-chunk-9

Loess

loess

Loess

plot of chunk unnamed-chunk-10

Different spans

loess

Different spans

plot of chunk unnamed-chunk-11

Beware of Defaults

ggplot(dat, aes(X, Y)) + geom_point(shape=1, cex=5) +
  geom_smooth(color="red")

plot of chunk unnamed-chunk-12

Distance

Clustering of animals.

Distance

The euclidean distance between \( A \) and \( B \) is simply:

\[ \sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2} \]

Distance in High Dimensions

sample_n(train_set,10) %>% select(label, pixel351:pixel358) %>% kable
label pixel351 pixel352 pixel353 pixel354 pixel355 pixel356 pixel357 pixel358
2 3 191 253 133 0 0 0 0
7 0 0 179 253 243 8 0 0
2 255 170 0 0 0 0 0 0
7 0 0 42 254 254 140 0 0
2 254 233 0 0 0 0 0 0
2 0 0 55 248 253 247 115 0
7 0 0 0 0 39 253 253 0
7 0 0 0 0 14 201 128 0
2 0 0 169 254 253 254 84 0
7 5 0 36 253 252 196 0 0

Distance between observations

\[ \mbox{dist}(1,2) = \sqrt{ \sum_{j=1}^{784} (Y_{1,j}-Y_{2,j })^2 } \]

Distance between features

\[ \mbox{dist}(15,273) = \sqrt{ \sum_{i=1}^{500} (Y_{i,15}-Y_{i,273})^2 } \]

Matrices with R

X <- select(train_set , pixel0:pixel783) %>% as.matrix()

Rows and columns of matrices can be accessed like this:

thrid_row <- X[3,]
tenth_column <- X[,10]

Distance in R

X_1 <- X[1,]
X_2 <- X[2,]
X_253 <- X[253,]
sqrt(sum((X_1-X_2)^2))
[1] 2918.257
sqrt(sum((X_1-X_253)^2))
[1] 2787.913

Full Distance Matrix

d <- dist(X)
as.matrix(d)[1,2]
[1] 2918.257
as.matrix(d)[1,153]
[1] 3058.504

Image of distance

image(as.matrix(d))

plot of chunk unnamed-chunk-21

k Nearest Neighbors

Logistic regression is a straw man:

tmp_data <- select(train_set, y, X_1, X_2)
glm_fit <- glm(y~.,data = tmp_data )
f_hat <- predict(glm_fit, newdata = test_set, 
                 type = "response")
tab <- table(pred=round(f_hat), truth=test_set$y)
confusionMatrix(tab)$overall["Accuracy"]
Accuracy 
   0.794 

kNN with k=5

knn_fit <- knn3(y~.,data = select(train_set, y, X_1, X_2) )
f_hat <- predict(knn_fit, newdata = test_set)[,2]
tab <- table(pred=round(f_hat), truth=test_set$y)
confusionMatrix(tab)$overall["Accuracy"]
Accuracy 
   0.822 

kNN, k=5

plot of chunk unnamed-chunk-24

Over-Training

f_hat <- predict(knn_fit, newdata = test_set)
tab <- table(pred=round(f_hat[,2]), 
             truth=test_set$y)
confusionMatrix(tab)$overall["Accuracy"]
Accuracy 
   0.822 
f_hat_train <- predict(knn_fit, 
                       newdata = train_set)
tab <- table(pred=round(f_hat_train[,2]), 
             truth=train_set$y)
confusionMatrix(tab)$overall["Accuracy"]
Accuracy 
   0.884 

Over-Training

Training set:

tmp_data <- select(train_set, y, X_1, X_2)
knn_fit_1 <- knn3(y~., data = tmp_data, k=1)
f_hat <- predict(knn_fit_1, 
                 newdata = train_set)
tab <- table(pred=round(f_hat[,2]), 
             truth=train_set$y)
confusionMatrix(tab)$overall["Accuracy"]
Accuracy 
       1 

Over-Training

Test set:

f_hat <- predict(knn_fit_1,newdata =test_set)
tab <- table(pred=round(f_hat[,2]), 
             truth=test_set$y)
confusionMatrix(tab)$overall["Accuracy"]
Accuracy 
    0.81 

kNN, k=1

plot of chunk unnamed-chunk-28

Over-smooth

knn_fit_251 <- knn3(y~.,data = select(train_set, y, X_1, X_2), k=251)
f_hat <- predict(knn_fit_251, newdata = test_set)[,2]
tab <- table(pred=round(f_hat), truth=test_set$y)
confusionMatrix(tab)$overall["Accuracy"]
Accuracy 
     0.8 

Over-Smooth

plot of chunk unnamed-chunk-30

Accuracy for several k

control <- trainControl(method='cv', number=2, p=.5)
dat2 <- mutate(dat, label=as.factor(label)) %>%
  select(label,X_1,X_2)
res <- train(label ~ .,
             data = dat2,
             method = "knn",
             trControl = control,
             tuneLength = 1, # How fine a mesh to go on grid
             tuneGrid=data.frame(k=seq(3,151,2)),
             metric="Accuracy")

Accuracy for several k

plot of chunk unnamed-chunk-32

kNN, k=11

plot of chunk unnamed-chunk-33

Visualizing Results

Correctly called 2 with high probability

plot of chunk unnamed-chunk-35

Incorrectly called 2 but had high probability

plot of chunk unnamed-chunk-36

Predictor was about 50-50

plot of chunk unnamed-chunk-37

Correctly called 7 with high probability

plot of chunk unnamed-chunk-38

Incorrectly called 7 but had high probability

plot of chunk unnamed-chunk-39